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1  Kalman  filter  equations 


Abstract 

Using  known  camera  motion  to  estimate  depth  from  image  sequences  is  important  in  robotics 
applications  such  as  navigation  and  manipulation.  For  many  applications,  having  an  on-line, 
incremental  estimate  of  depth  is  important.  To  permit  the  blending  of  new  measurements  with 
old  estimates,  it  is  essential  that  the  representation  include  not  only  the  current  depth  estimate, 
but  also  an  estimate  of  the  current  uncertainty.  Kalman  filtering  provides  the  needed  framework 
to  integrate  new  measurements  and  reduce  the  uncertainty  over  time.  Previous  applications  of 
Kalman  filtering  to  depth  from  motion  have  been  limited  to  the  estimation  of  depth  at  the  location 
of  a  sparse  set  of  features.  In  this  paper,  we  introduce  a  new  pixel-based  ( iconic )  algorithm  that 
estimates  depth  from  an  image  sequence  and  incrementally  refines  its  estimate  over  time.  We  also 
present  a  feature-based  version  of  the  algorithm  which  is  used  for  comparison.  We  compare  the 
performance  of  both  approaches  mathematically,  with  quantitative  experiments  using  images  of  a 
flat  scene,  and  with  qualitative  experiments  using  images  of  a  realistic  outdoor  scene  model.  The 
results  show  that  the  method  is  an  effective  way  to  extract  depth  from  lateral  camera  translations. 
Our  approach  can  be  extended  to  incorporate  general  motion  and  to  integrate  other  sources  of 
information  such  as  stereo.  The  algorithms  which  we  have  developed,  which  combine  Kalman 
filtering  with  iconic  descriptions  of  depth,  can  thus  serve  as  a  useful  and  general  framework  for 
low-level  dynamic  vision. 


1  Introduction 


Using  known  camera  motion  to  estimate  depth  from  image  sequences  is  important  in  many 
robotics  applications  such  as  navigation  and  manipulation.  Depth  from  motion  can  also  be  an 
important  element  of  a  multi-modal  sensing  strategy,  and  can  be  used  to  guide  stereo  matching. 
For  many  applications,  having  an  on-line,  incremental  estimate  of  depth  is  important.  To  develop 
such  an  incremental  algorithm,  it  is  essential  that  the  representation  include  not  only  the  current 
depth  estimate,  but  also  an  estimate  of  the  current  uncertainty. 

Previous  work  [Broida86]  [Faugeras86]  [Hallam83]  [Matthies87c]  [Matthies87b]  [Rives86] 
has  identified  Kalman  filtering  as  a  viable  framework  for  this  problem,  since  it  incorporates 
representations  of  uncertainty  in  the  depth  model  and  provides  a  mechanism  for  incrementally 
reducing  this  uncertainty  over  time.  To  date,  this  framework  has  largely  been  restricted  to 
estimating  the  positions  of  a  sparse  set  of  trackable  features  such  as  points  or  line  segments. 
While  this  is  adequate  for  many  robotics  applications,  it  requires  reliably  extracting  features,  and 
it  fails  to  describe  large  areas  of  the  image.  Another  line  of  previous  research  has  addressed  the 
problem  of  extracting  dense  displacement  or  depth  estimates  from  image  sequences.  However, 
these  previous  approaches  have  either  been  restricted  to  two  frame  analysis  [Anandan85],  or 
have  used  batch  processing  of  the  image  sequence  (using  the  epipolar  plane  method  [Bolles87] 
or  spatio-temporal  filtering  [Heeger87]). 

In  this  paper  we  introduce  a  new  image-based  ( iconic )  approach  to  incremental  depth  esti¬ 
mation  and  compare  it  mathematically  and  experimentally  to  a  feature-based  approach  we  have 
used  previously  [Matthies87b].  This  approach  represents  depth  and  depth  variance  at  every  pixel 
and  uses  Kalman  filtering  to  extrapolate  and  update  these  pixel-based  maps.  The  algorithm  uses 
correlation  to  measure  the  optical  flow  and  to  estimate  the  variance  in  the  flow,  and  converts 
the  flow  field  to  a  depth  map  using  the  known  camera  motion.  It  then  uses  the  Kalman  filter 
to  appropriately  weight  both  the  new  measurements  and  prior  estimates  of  depth  to  generate 
an  updated  depth  map.  Regularization  is  employed  to  smooth  the  depth  map  and  to  fill  in  the 
underconstrained  areas.  The  resulting  algorithm  is  parallel  and  uniform,  and  can  take  advantage 
of  mesh-connected  or  multi-resolution  (pyramidal)  processing  architectures. 

The  remainder  of  this  paper  is  structured  as  follows.  In  the  next  section,  we  give  a  brief 
review  of  Kalman  filtering,  and  introduce  our  overall  approach  to  Kalman  filtering  of  depth. 
Next,  we  review  the  equations  of  motion,  present  a  simple  camera  model,  and  examine  the 
potential  accuracy  of  the  method  by  analyzing  its  sensitivity  to  the  direction  of  camera  motion. 
We  then  describe  our  new  iconic  incremental  depth  from  motion  algorithm.  This  is  followed 
by  a  description  of  the  feature-based  incremental  depth  from  motion  algorithm  that  is  used  for 
comparison.  The  theoretical  accuracy  of  these  two  methods  is  then  derived  and  compared  to 
that  of  stereo  matching.  This  analysis  is  verified  experimentally  by  using  images  of  a  flat  scene. 
We  then  show  the  performance  of  both  methods  on  images  of  realistic  outdoor  scene  models. 


In  the  final  section,  we  discuss  the  promise  and  the  problems  involved  in  extending  the  method 
to  arbitrary  motion.  We  also  conclude  that  the  ideas  and  results  presented  apply  directly  to  the 
much  broader  problem  of  integrating  depth  information  from  multiple  sources. 

2  Estimation  framework 

The  depth  from  motion  algorithms  described  in  this  paper  use  a  sequence  of  images  taken  with 
small  inter-frame  displacements  [Bolles87].  The  advantage  of  using  such  a  sequence  is  that  the 
correspondence  problem  between  two  successive  images  is  reduced.  The  disadvantage  is  that 
the  individual  depth  measurements  are  less  precise  because  of  the  very  small  baselines  involved. 
To  overcome  this  latter  problem,  information  from  each  pair  of  frames  must  be  integrated  over 
time.  For  many  robotics  applications  it  is  desirable  to  process  these  images  using  a  real-time 
rather  than  batch  process,  with  an  updated  depth  estimate  being  generated  after  each  new  image 
is  acquired.  The  incremental  algorithm  also  has  the  advantage  of  requiring  less  storage,  since 
only  the  current  estimate  and  its  uncertainty  model  are  required. 

A  powerful  technique  for  doing  real-time  estimation  of  such  dynamic  systems  is  the  Kalman 
filter.  This  formulation  allows  for  the  integration  of  information  over  time,  and  is  robust  with 
respect  to  both  system  and  sensor  noise.  The  notation  and  equations  of  the  Kalman  filter  are 
presented  first,  along  with  a  simple  example.  The  application  of  this  framework  to  motion 
sequence  processing  is  then  sketched,  discussing  those  parts  that  are  common  to  both  iconic  and 
feature  based  algorithms  (the  details  of  these  algorithms  are  in  Sections  4  and  5,  respectively). 

2.1  Kalman  filter 

The  Kalman  filter  is  a  Bayesian  estimation  technique  used  to  track  stochastic  dynamic  systems 
being  observed  with  noisy  sensors.  The  filter  is  based  on  three  separate  probabilistic  models,  as 
shown  in  Table  1.  The  first  model,  the  system  model,  describes  the  evolution  over  time  of  the 
current  state  vector  u,.  The  transition  between  states  is  characterized  by  the  known  transition 
matrix  and  the  addition  of  Gaussian  noise  with  a  covariance  Qt.  The  second  model,  the 
measurement  (or  sensor)  model,  relates  the  measurement  vector  d,  to  the  current  state  through  a 
measurement  matrix  Ht  and  the  addition  of  Gaussian  noise  with  a  covariance  R,.  The  third  model, 
the  prior  model,  describes  the  knowledge  about  the  system  state  Uq  and  its  covariance  P0  before 
the  first  measurement  is  taken.  The  sensor  and  process  noise  are  assumed  to  be  uncorrelated. 

To  explain  the  above  equations,  we  will  use  the  example  of  a  ping-pong  playing  robot  which 
must  track  a  moving  ball.  In  this  example,  the  state  consists  of  the  ball  position  and  velocity, 
u  =  [x  y  z  x  y  z  l]r,  where  x  and  y  lie  parallel  to  the  image  plane  (y  is  up),  and  z  is  parallel  to 
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system  model  u,  =  +  rjt,  r/t~N(0,Qt) 

measurement  model  d,  =  H,u,  +  £,  ~  N(0,Rt) 

prior  model  E[uo]  =  Uq ,  Cov[uo]  =  Po 

(other  assumptions)  E[r)t£f]  =  0 _ 

Prediction  state  estimate  extrapolation  iij 

phase _ state  covariance  extrapolation  P~  =  +  Q,-  \ 

state  estimate  update  u*  =  u~  +Kt[d,  —  H,u~  ] 

state  covariance  update  P+t  =  [/  —  K,H,]P~ 

Kalman  gain  matrix  K,  =  P~Hf[H,P~Hf  +  R,]~l 

Table  1:  Kalman  filter  equations 
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the  optical  axis.  The  state  transition  matrix  models  the  ball  dynamics,  for  example 
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where  At  is  the  time  step,  /3  is  the  coefficient  of  friction  and  g  is  gravitational  acceleration. 
The  process  noise  matrix  Qt  models  the  random  disturbances  that  influence  the  trajectory.  If  we 
assume  that  the  camera  uses  orthographic  projection,  and  uses  a  simple  algorithm  to  find  the 
“center  of  mass”  (x,  y)  of  the  ball,  the  sensor  can  then  be  modeled  by 

[1  000000‘ 

W‘~[o  1  0  0  0  0  0  ‘ 

The  uncertainty  in  the  sensed  ball  position  can  be  modeled  by  a  2  x  2  covariance  matrix  R,. 

Once  the  system,  measurement  and  prior  models  (upper  third  of  Table  1)  have  been  specified, 
the  Kalman  filter  algorithm  follows  from  the  formulation  in  the  lower  two  thirds  of  Table  1. 
The  algorithm  operates  in  two  phases:  extrapolation  (prediction)  and  update  (correction).  The 
previous  state  estimate  u is  used  to  predict  the  current  state  u~.  At  the  same  time,  the 
previous  state  covariance  P\_  x  is  extrapolated  to  the  predicted  state  covariance  Pj .  This  predicted 
covariance  is  used  to  compute  the  new  Kalman  gain  matrix  K,  and  the  updated  covariance  matrix 
P*t.  Finally,  the  measurement  residual  d,  —  H,u~  is  weighted  by  the  gain  matrix  K,  and  added 
to  the  predicted  state  u~  to  yield  the  updated  state  u/.  A  block  diagram  for  the  Kalman  filter  is 
given  in  Figure  1. 


Figure  1 :  Kalman  filter  block  diagram 


2.2  Application  to  depth  from  motion 

To  apply  the  Kalman  filter  estimation  framework  to  the  depth  from  motion  problem,  we  specialize 
each  of  the  three  models  (system,  measurement  and  prior)  and  define  the  implementations  of 
the  extrapolation  and  update  stages.  This  section  briefly  previews  how  these  components  are 
chosen  for  the  two  depth  from  motion  algorithms  described  in  this  paper.  The  details  of  the 
implementation  are  left  to  Sections  4  and  5. 

The  first  thing  to  specify  when  designing  the  Kalman  filter  is  the  representation  used  for  the 
state  vector.  For  the  iconic  depth  from  motion  algorithm,  the  state  is  a  depth  map,  where  the 
value  of  depth  at  each  point  in  the  current  image  is  estimated1.  For  the  feature  based  approach, 
the  three  dimensional  location  of  each  feature  (in  our  case  edge  element)  is  estimated.  For 
both  methods,  an  uncertainty  map  is  estimated  and  propagated2.  For  the  iconic  approach,  the 
measurement  noise  can  be  spatially  varying  due  to  local  contrast  in  the  image.  For  the  feature 
based  approach,  the  accuracy  of  edge  positions  may  also  vary.  Thus  for  both  methods,  the  initial 
measurement  stage  produces  not  only  a  depth  measurement,  but  also  an  associated  variance. 

The  extrapolation  stage  for  the  two  approaches  shares  the  same  motion  equations  (see  Section 
3.1),  but  differs  because  of  the  underlying  representation.  For  the  iconic  method,  the  map  is 
warped  to  predict  what  it  will  look  like  in  the  next  frame,  and  resampled  to  keep  it  iconic.  For 

1  In  our  actual  implementation,  inverse  depth  (called  “disparity”)  is  used.  See  Section  4. 

2In  the  usual  Kalman  filter  implementation,  the  covariance  of  the  measurement  noise  is  known  in  advance,  as 
are  the  system  and  measurement  models,  so  that  the  gain  matrix  K,  can  be  pre-computed. 


the  feature  based  method,  the  three  dimensional  position  of  the  features  is  extrapolated. 

Finally,  the  prior  model  can  be  used  to  embed  prior  knowledge  about  the  scene.  In  par¬ 
ticular,  smoothness  constraints  (that  require  nearby  points  to  have  similar  disparity)  can  easily 
be  integrated  into  the  iconic  method,  and  can  be  used  to  reduce  the  noisy  nature  of  the  flow 
estimates.  For  the  edge  tracking  approach,  figural  continuity  [Mayhew81]  [Ohta85]  could  be 
used  (i.e.  connected  edges  must  match  connected  edges),  but  this  is  currently  not  used. 

3  Motion  equations  and  camera  model 

Our  system  and  measurement  models  are  based  on  the  equations  relating  scene  depth  and  camera 
motion  to  the  induced  image  flow.  In  this  section,  we  review  these  equations  for  an  idealized 
camera  (focal  length  =  1)  and  show  how  to  use  a  simple  calibration  model  to  relate  the  idealized 
equations  to  real  cameras.  We  also  derive  an  expression  for  the  relative  uncertainty  in  depth 
estimates  obtained  from  lateral  versus  forward  camera  translation.  This  expression  shows  con¬ 
cretely  the  effects  of  camera  motion  on  depth  uncertainty  and  reinforces  the  need  for  modeling 
the  uncertainty  in  computed  depth. 

3.1  Equations  of  motion 

If  the  inter-frame  camera  motion  is  sufficiently  small,  the  resulting  optical  flow  can  be  expressed 
to  a  good  approximation  in  terms  of  the  instantaneous  camera  velocity  [Longuet80],  [Bruss83], 
[Waxman86].  We  will  specify  this  in  terms  of  a  translational  velocity  T  and  an  angular  velocity 
R.  In  the  camera  coordinate  frame  (Figure  2),  the  motion  of  a  3-D  point  P  is  described  by  the 
equation 

dP 

=  -T  -  R  x  P. 
dt 

Expanding  this  into  components  yields 

dX/dt  =  —Tx  —  RyZ  +  RZY 

dY/dt  =  -Ty-  RzX  +  RxZ  (1) 

dZ/dt  =  -Tz-  RxY  +  RyX. 

Now,  projecting  (X,  Y ,  Z)  onto  an  ideal,  unit  focal  length  image. 
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camera  axis. 


P  =  (X,  Y,Z) 


Figure  2:  Camera  model 
CP  is  the  center  of  projection 


taking  the  derivatives  of  ( x ,  y)  with  respect  to  time,  and  substituting  in  from  equation  (2)  leads 
to  the  familiar  equations  of  optical  flow  [Waxman86] 
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xy 
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These  equations  relate  the  depth  Z  of  the  point  to  the  camera  motion  T,  R  and  the  induced  image 
displacements  or  optical  flow  [Ax  Ay]T.  We  will  use  these  equations  to  measure  depth,  given 
the  camera  motion  and  optical  flow,  and  to  predict  the  change  in  the  depth  map  between  frames. 

3.2  Camera  model 

Relating  the  ideal  flow  equations  to  real  measurements  requires  a  camera  model.  If  optical 
distortions  are  not  severe,  a  pin-hole  camera  model  will  suffice.  In  this  paper  we  adopt  a  model 
similar  to  that  originated  by  Sobel  [Sobel74]  (Figure  2).  This  model  specifies  the  origin  (cx,  cy)  of 
the  image  coordinate  system  and  a  pair  of  scale  factors  ( sx ,  sy)  that  combine  the  focal  length  and 
image  aspect  ratio.  Denoting  the  actual  image  coordinates  with  a  subscript  “a”,  the  projection 


onto  the  actual  image  is  summarized  by  the  equation 
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(3) 


C  is  the  known  as  the  collimation  matrix.  Thus,  the  ideal  image  coordinates  (x,y)  are  related  to 
the  actual  image  coordinates  by 


xa  =  sxx  +  cx 
ya  =  syy  +  cr 

Equations  in  the  balance  of  the  paper  will  primarily  use  ideal  image  coordinates  for  clarity. 
These  equations  can  be  re-expressed  in  terms  of  actual  coordinates  using  the  transformations 
above. 


3.3  Sensitivity  analysis 

Before  describing  our  Kalman  filter  algorithms,  we  will  analyze  the  effect  of  different  camera 
motions  on  the  uncertainty  in  depth  estimates.  Given  specific  descriptions  of  real  cameras  and 
scenes,  we  can  obtain  bounds  on  the  estimation  accuracy  of  depth-from-motion  algorithms  using 
perturbation  or  covariance  analysis  techniques  based  on  first-order  Taylor  expansions  [Wertz78]. 
For  example,  if  we  solve  the  motion  equations  for  the  inverse  depth  d  in  terms  of  the  optical 
flow,  camera  motion,  and  camera  model, 

d  =  F(Ax,  Ay;  T ,  R;  cx,  cy,  sx,  sy),  (4) 

then  the  uncertainty  in  depth  arising  from  uncertainty  in  flow,  motion,  and  calibration  can  be 
expressed  by 

6d  =  Jf6f  +  Jm6m  +  Jc8c,  (5) 

where  J/ ,  Jm,  and  Jc  are  the  Jacobians  of  (4)  with  respect  to  the  flow,  motion,  and  calibration 
parameters,  respectively,  and  8f,  8m,  and  8c  are  perturbations  of  the  respective  parameters.  We 
will  use  this  methodology  to  draw  some  concrete  conclusions  about  the  relative  accuracy  of 
depth  estimates  obtained  from  different  classes  of  motion. 

It  is  well  known  that  camera  rotation  provides  no  depth  information.  Furthermore,  for  a 
translating  camera,  the  accuracy  of  depth  estimates  increases  with  increasing  distance  of  image 
features  from  the  focus  of  expansion  (FOE),  the  point  in  the  image  where  the  translation  vector 
(T)  pierces  the  image.  This  implies  that  the  ‘best’  translations  are  parallel  to  the  image  plane 
and  that  the  ‘worst’  are  forward  along  the  camera  axis.  A  lengthy  examination  of  the  effects 
of  measurement  uncertainty  in  depth  from  motion  is  given  in  [Snyder87];  here  we  will  give  a 
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shorter  derivation  that  demonstrates  the  relative  accuracy  obtainable  from  forward  and  lateral 
camera  translation. 

For  clarity  we  consider  only  one-dimensional  flow  induced  by  translation  along  the  X  or  Z 
axes.  For  an  ideal  camera,  lateral  motion  induces  the  flow 


Ax,= 


ill 

Z  ’ 


whereas  forward  motion  induces  the  flow 


Axf  = 


xTz 


The  inverse  depth  (or  disparity)  in  each  case  is 


(6) 

(7) 


1  _  —Ax i 

z  =  — 

Axf_ 
xT2  ' 

Therefore,  perturbations  of  Sxt  and  Sx/  in  the  flow  measurements  Axt  and  Axf  yield  the  following 
perturbations  in  the  disparity  estimates: 


d,  = 
df  = 


Sdi 

8df 


8xi 

W3 

6xf 

Wz 


These  equations  give  the  error  in  the  inverse  depth  as  a  function  of  the  error  in  the  measured 
image  displacement,  the  amount  of  camera  motion,  and  position  of  the  feature  in  the  field  of  view. 
Since  we  are  interested  in  comparing  forward  and  lateral  motions,  a  good  way  to  visualize  these 
equations  is  to  plot  the  relative  depth  uncertainty,  Sdf/Sdi.  Assuming  that  the  flow  perturbations 
6xi  and  Sx/  are  equal,  the  relative  uncertainty  is 


6df  _  6xf/\xTz\  _  Tx_ 
8dt  Sx,/\TX\  xTz 


(8) 


The  image  coordinate  x  indicates  where  the  object  appears  in  the  field  of  view.  Figure  3  shows 
that  x  equals  the  tangent  of  the  angle  9  between  the  object  and  the  camera  axis.  The  formula 
for  the  relative  uncertainty  is  thus 

6df  _  ~ 


Sdi 


T,  tan  6 


(9) 


Figure  3:  Angle  between  object  and  camera  axis  is  0 


This  relationship  is  plotted  in  Figure  4  for  Tx  =  Tz.  At  45  degrees  from  the  camera  axis, 
depth  uncertainty  is  equal  for  forward  and  lateral  motions.  At  18  degrees,  which  is  the  edge  of 
the  image  for  the  experiments  in  Section  6.2,  the  ratio  of  uncertainties  is  3.1;  at  9  degrees,  the 
ratio  is  6.3.  Thus,  the  accuracy  of  depth  extracted  from  forward  motion  is  effectively  unusable 
for  a  large  part  of  the  image.  An  alternative  interpretation  for  this  curve  is  that  it  expresses 
the  relative  precision  of  stereo  and  depth-from-motion  in  a  motion  stereo  system.  By  setting 
8 d.f  / Sdt  =  1,  equation  (9)  also  expresses  the  relative  distances  the  camera  must  move  forward 
and  laterally  to  obtain  equally  precise  depth  estimates. 

We  draw  several  conclusions  from  this  analysis.  First,  it  underscores  the  value  of  representing 
depth  uncertainty  as  we  describe  in  the  following  sections.  Second,  for  practical  depth  estimation, 
forward  motion  is  effectively  unusable  compared  with  lateral  motion.  Finally,  we  can  relate  these 
results  to  motion  stereo  by  noting  that  depth  from  forward  motion  will  be  of  little  value  in  a 
motion  stereo  system. 

4  Iconic  depth  estimation 

This  section  describes  the  incremental  (on-line)  iconic  depth  estimation  algorithm  that  we  have 
developed.  The  algorithm  processes  each  new  image  as  it  arrives,  extracting  optic  flow  at  each 
pixel  using  the  current  and  previous  intensity  images,  and  then  integrates  this  new  information 
with  the  current  depth  estimate. 


raw 


cumulative  smoothed 


Figure  5:  Iconic  depth  estimation  block  diagram 

The  algorithm  consists  of  four  main  stages  (Figure  5).  The  first  stage  uses  correlation  to 
compute  an  estimate  of  the  displacement  vector  and  its  associated  covariance.  It  converts  this 
estimate  into  a  disparity  (inverse  depth)  measurement  using  the  known  camera  motion.  The 
second  stage  integrates  this  information  with  the  disparity  map  predicted  at  the  previous  time 
step.  The  third  stage  uses  regularization-based  smoothing  to  reduce  measurement  noise  and  to 
fill  in  areas  of  unknown  disparity.  The  last  stage  uses  the  known  camera  motion  to  predict  the 
disparity  field  that  will  be  seen  in  the  next  frame,  and  re- samples  the  field  to  keep  it  iconic  (pixel 
based). 

4.1  Measuring  disparity 

The  first  stage  of  the  Kalman  filter  computes  a  disparity  map  from  the  difference  in  intensity 
between  the  current  image  and  the  previous  image.  This  computation  proceeds  in  two  parts. 
First,  a  two-dimensional  displacement  (or  optic  flow)  vector  is  computed  at  each  point  using 
a  correlation  based  algorithm.  The  uncertainty  in  this  vector  is  characterized  by  a  bivariate 
Gaussian  distribution.  Second,  this  vector  is  converted  into  a  disparity  measurement  using  the 
known  camera  motion  and  the  motion  equations  developed  in  Section  3.1. 

This  two  stage  formulation  is  desirable  for  several  reasons.  First,  it  allows  probabilistic 
characterizations  of  uncertainty  in  flow  to  be  translated  into  a  probabilistic  characterization  of 
the  uncertainty  in  disparity.  This  especially  valuable  if  the  camera  motion  is  also  uncertain,  since 
the  equations  relating  flow  to  disparity  can  be  extended  to  model  this  as  well  [Rives86].  Second, 
it  requires  only  that  we  can  characterize  the  level  of  uncertainty  in  the  flow,  and  allows  us  to 


evaluate  the  potential  accuracy  of  the  algorithm  independent  of  how  flow  is  obtained.  Finally, 
bivariate  Gaussian  distributions  can  capture  the  distinctions  between  knowing  zero,  one,  or  both 
components  of  flow  [Anandan85],[Nagel86],[Heeger87],  and  can  thus  subsume  the  notion  of  the 
aperture  problem. 

Let  us  turn  first  to  the  problem  of  extracting  optical  flow  from  a  sequence  of  intensity 
images,  which  has  been  extensively  studied  in  computer  vision.  Early  approaches  used  the  ratio 
of  the  spatial  and  temporal  image  derivatives  [Hom81],  while  more  recent  approaches  have  used 
correlation  between  images  [Anandan85]  or  spatio-temporal  filtering  [Heeger87].  In  this  paper 
we  use  a  simple  version  of  correlation-based  matching.  This  technique,  which  has  been  called 
the  Sum  of  Squared  Differences  (SSD)  method  [Anandan85],  integrates  the  squared  intensity 
difference  between  two  shifted  images  over  a  small  area  to  obtain  an  error  measure 

et(Ax,Ay;x,y)  -  f  J  w(\,ri)[ft(x  -  Ax  + \,y  -  Ay +  rj)  -  f_{(x  + \,y +  rj)]2  d\dr), 

where  f  and  f-\  are  the  two  intensity  images,  and  w(A,  rj)  is  a  weighting  function.  The 
SSD  measure  is  computed  at  each  pixel  for  a  number  of  possible  flow  values.  In  Anandan’s 
algorithm,  a  coarse-to-fine  technique  is  used  to  limit  the  range  of  possible  flow  values.  In  our 
images  the  possible  range  of  values  is  small  (since  we  are  using  small-motion  sequences),  so  a 
single-resolution  algorithm  suffices3.  The  resulting  error  surface  e,(Ax,  Ay;x,y )  is  approximately 
parabolic  in  shape.  The  lowest  point  of  this  surface  defines  the  flow  measurement  and  the  shape 
of  the  surface  defines  the  bivariate  covariance  of  the  measurement. 

To  convert  the  displacement  vector  [Ax  Ay]r  into  a  disparity  measurement,  we  assume  that 
the  camera  motion  (T,  R)  is  given.  The  optical  flow  equation  (2)  can  then  be  used  to  estimate 
depth  as  follows.  First  we  abbreviate  (2)  to 


where  d  is  the  inverse  depth  and  £  is  an  error  vector  representing  noise  in  the  flow  measurement. 
The  noise  £  is  assumed  to  be  bivariate  Gaussian  random  vector  with  a  zero  mean  and  a  covariance 
Pm  computed  by  the  flow  estimation  part.  Equation  10  can  be  re-expressed  in  the  following 
standard  form  for  linear  estimation  problems: 

4x=[  ^1  “  f  r'l  =  4 H+«=H</  +  ?-  (U) 

^  y  ry  J  ly 

The  optimal  estimate  of  the  disparity  d  is  then  [Maybeck79] 

d  =  (HTp-'H)-lHTP~lAx  (12) 

3  It  may  be  necessary  to  use  a  larger  search  range  at  first,  but  once  the  estimator  has  “latched  on”  to  a  good 
disparity  map,  the  predicted  disparity  and  disparity  variance  can  be  used  to  limit  the  search  by  computing  confidence 
intervals. 


Figure  6:  Parabolic  fit  to  SSD  error  surface 


and  the  variance  of  this  disparity  measurement  is 

*l  =  WrP-'H)-'.  (13) 

The  measurement  process  described  in  this  section  has  been  implemented  in  a  simplified 
form,  under  the  assumption  that  the  flow  is  parallel  to  the  image  raster.  Each  scanline  of  two 
successive  images  is  magnified  by  a  factor  of  4  by  cubic  interpolation.  The  SSD  measure  ek  is 
computed  at  each  interpolated  sub-pixel  displacement  v*  using  a  5  x  5  [pixel]  square  window. 
The  minimum  error  (vj,  e-k)  is  found  and  a  parabola 

e(y)  =  ov2  +  bv  +  c 

is  fit  to  this  point  and  its  two  neighbors  ek_{)  and  (vf+1,  ek+l)  (Figure  6).  The  minimum  of 
this  parabola  establishes  the  flow  estimate  (to  sub-sub-pixel  precision).  Appendix  A  shows  that 
the  variance  of  the  flow  measurement  is 


2-2 

Var(e)  =  -A 
a 


where  <72  is  the  variance  of  the  image  noise  process.  The  appendix  also  shows  that  adjacent  flow 
estimates  are  correlated  over  both  space  and  time;  the  significance  of  this  fact  will  be  considered 
in  Section  6.1. 


4.2  Updating  the  disparity  map 

The  next  stage  in  the  iconic  depth  estimator  is  the  integration  of  the  new  disparity  measurements 
with  the  predicted  disparity  map  (this  step  is  omitted  for  the  first  pair  of  images).  For  now,  we 
will  assume  that  each  value  in  the  measured  or  predicted  disparity  map  is  not  correlated  with  its 
neighbors,  so  that  the  map  updating  can  be  done  at  each  pixel  independently.  The  extension  of 
this  model  to  account  for  the  correlated  nature  of  disparity  maps  is  discussed  later. 

To  update  a  pixel  value,  we  first  compute  the  variance  of  the  updated  disparity  estimate 

pX(p;r'  +  ^r'r'  =  -f^K 

Pt  +°d 

and  the  Kalman  filter  gain  K 

v  _  Pt  _  PT 

ad  Pt 

We  then  update  the  disparity  value  by  using  the  Kalman  filter  update  equation 

u*  =  u~  +K(d  -  u~) 

where  and  u*  are  the  predicted  and  updated  disparity  estimates,  and  d  is  the  new  disparity 
measurement.  This  update  equation  can  also  be  written  as 


The  latter  form  shows  that  the  updated  disparity  estimate  is  a  linear  combination  of  the  predicted 
and  measured  values,  inversely  weighted  by  their  respective  variances. 

4.3  Smoothing  the  map 

The  raw  depth  or  disparity  values  obtained  from  optical  flow  measurements  can  be  very  noisy, 
especially  in  areas  of  uniform  intensity.  We  employ  smoothness  constraints  to  reduce  the  noise 
and  to  “fill  in”  underconstrained  areas.  The  earliest  example  of  this  approach  is  that  of  Horn 
and  Schunck  [Hom81].  The  optical  flow  field  (u,  v)  is  smoothed  by  jointly  minimizing  the  error 
in  the  flow  equation 

£b  =  Exu  +  Eyv  +  E, 

(E  is  image  intensity)  and  the  departure  from  smoothness 

£2c  =  |Vw|2  +  |  Vv|2. 

The  smoothed  flow  is  that  which  minimizes  the  total  error 

£2  =  J  j(£2b+a2£2)dxdy 


where  a  is  a  blending  constant.  More  recently,  this  approach  has  been  formalized  using  the  theory 
of  regularization  [Terzopoulos86a]  and  extended  to  use  two-dimensional  confidence  measures 
equivalent  to  local  covariance  estimates  [Anandan85],[Nagel86]. 

For  our  application,  smoothing  is  done  on  the  disparity  field,  using  the  inverse  variance  of  the 
disparity  estimate  as  the  confidence  in  each  measurement.  The  smoother  we  use  is  the  generalized 
piecewise  continuous  spline  under  tension  [Terzopoulos86b]  which  uses  finite  element  relaxation 
to  compute  the  smoothed  field.  The  algorithm  is  implemented  with  a  three-level  coarse-to-fine 
strategy  to  speed  convergence,  and  is  amenable  to  implementation  on  a  parallel  computer. 

After  the  initial  smoothing  has  been  performed,  depth  discontinuities  are  detected  by  thresh¬ 
olding  the  angle  between  the  view  vector  and  the  local  surface  normal  (Appendix  B)  and  doing 
non-maximum  suppression.  This  is  superior  to  applying  edge  detection  directly  to  the  dispar¬ 
ity  image,  because  it  properly  takes  into  account  the  3-D  geometry  and  perspective  projection. 
Once  discontinuities  have  been  detected,  they  are  incorporated  into  the  piecewise  continuous 
smoothing  algorithm,  and  a  few  more  relaxation  steps  are  performed.  Our  approach  to  discon¬ 
tinuity  detection,  which  interleaves  smoothing  and  edge  detection,  is  similar  to  Terzopoulos’ 
continuation  method  [Terzopoulos86b].  The  alternative  of  trying  to  estimate  the  boundaries  in 
conjunction  with  the  smoothing  [Marroquin87]  has  not  been  tried,  but  could  be  implemented 
within  our  framework.  The  detected  discontinuities  could  also  be  propagated  to  the  next  frame, 
but  this  has  not  been  implemented. 

The  smoothing  stage  can  be  viewed  as  the  part  of  the  Kalman  filtering  algorithm  that  incor¬ 
porates  prior  knowledge  about  the  smoothness  of  the  disparity  map.  As  shown  in  [Szeliski87a], 
a  regularization-based  smoother  is  equivalent  to  a  prior  model  with  a  correlation  function  defined 
by  the  degree  of  the  stabilizing  spline  (e.g.  membrane  or  thin  plate).  The  resulting  covariance 
matrix  of  the  disparity  map  contains  off-diagonal  elements  modeling  the  covariance  of  neighbor¬ 
ing  pixels.  An  optimal  implementation  of  the  Kalman  filter  would  require  transforming  (warping) 
the  prior  model  covariance  during  the  prediction  stage,  and  would  significantly  complicate  the 
implementation  of  our  algorithm.  In  practice,  our  current  implementation  which  uses  the  same 
amount  of  smoothing  at  each  step  has  proved  to  be  sufficient. 

4.4  Predicting  the  next  disparity  map 

The  final  step  in  defining  the  filter  is  to  specify  how  the  disparity  estimates  are  extrapolated  from 
the  previous  maps  and  the  motion  estimate.  The  process  must  predict  both  the  new  disparity 
at  each  pixel  in  the  image  and  the  uncertainty  in  disparity.  We  will  describe  the  disparity 
extrapolation  first,  then  consider  the  uncertainty  extrapolation. 

Our  approach  is  illustrated  in  Figure  7.  At  time  t,  the  current  disparity  map  and  motion 
estimate  are  used  to  predict  the  optical  flow  between  images  t  and  t+  1,  which  in  turn  indicates 


Figure  7:  Illustration  of  disparity  prediction  stage 


where  the  pixels  in  frame  t  will  ‘move  to’  in  the  next  frame: 

*/+  t  =  x,  +  Ax, 
yt+i  =  y,  +  Ay,. 

The  flow  estimates  are  computed  with  equation  (2),  assuming  that  Z,  T,  and  R  are  known4.  Next 
we  predict  what  the  new  depth  of  this  point  will  be  using  the  equations  of  motion.  From  (2)  we 
have 


AZ,  =  -Tt-RxY,+RyXt 

—  Tz  Rx  y,  Zf  +  Ry  x ,  Zt 
so  that  the  predicted  depth  atxr+i,y,+i  is 

Z,+\  =  Z,  +  AZ, 

=  (1  -Rxy,  +Ryx,)Z,  -  Tz 
=  a Zt  -  Tz. 


Rewriting  this  in  terms  of  inverse  depth,  we  obtain 


= 


a  -  7X 


(14) 


4There  will  be  uncertainty  in  xl+i  and  y,+i  due  to  uncertainty  in  the  motion  and  disparity  estimates.  We  ignore 
this  for  now. 


In  general  this  prediction  process  will  yield  estimates  of  disparity  in  between  pixels  in  the 
new  image  (Figure  7),  so  we  need  to  resample  to  obtain  predicted  disparity  at  pixel  locations. 
For  a  given  pixei  x'  in  the  new  image,  we  find  the  square  of  extrapolated  pixels  that  overlap  x' 
and  compute  the  disparity  at  x'  by  bi-linear  interpolation  of  the  extrapolated  disparities.  Note 
that  it  may  be  possible  to  detect  occlusions  by  recording  where  the  extrapolated  squares  turn 
away  from  the  camera.  Detecting  “disocclusions”,  where  newly  visible  areas  become  exposed, 
is  not  possible  if  the  disparity  field  is  assumed  to  be  continuous,  but  is  possible  if  disparity 
discontinuities  have  been  detected. 

Uncertainty  will  increase  in  the  prediction  phase  due  to  errors  from  many  sources,  including 
uncertainty  in  the  motion  parameters,  errors  in  calibration,  and  inaccurate  models  of  the  camera 
optics.  A  simple  approach  to  modeling  these  errors  is  to  lump  them  together  by  inflating  the 
current  variance  estimates  by  a  small  multiplicative  factor  in  the  prediction  stage.  Thus,  the 
variance  prediction  associated  with  the  disparity  prediction  of  equation  (14)  is 

Pm*(1+«K-  (15) 

In  the  Kalman  filtering  literature  this  is  known  as  exponential  age-weighting  of  measurements 
[Maybeck79],  because  it  decreases  the  weight  given  to  previous  measurements  by  an  exponential 
function  of  time.  This  is  the  approach  used  in  our  implementation.  We  first  inflate  the  variance 
in  the  current  disparity  map  using  equation  (15),  then  warp  and  interpolate  the  variance  map  in 
the  same  way  as  the  disparity  map.  A  more  exact  approach  is  to  attempt  to  model  the  individual 
sources  of  error  and  to  propagate  their  effects  through  the  prediction  equations.  Appendix  C 
examines  this  for  uncertain  camera  motion. 

5  Feature  based  depth  estimation 

The  dense,  iconic  depth  estimation  algorithm  described  in  the  previous  section  can  be  com¬ 
pared  with  existing  depth  estimation  methods  based  on  sparse  feature  tracking.  Such  methods 
[Ayache87]  [Broida86]  [Hallam83]  [Matthies87b]  typically  define  the  state  vector  to  be  the  pa¬ 
rameters  of  the  3-D  object  being  tracked,  which  is  usually  a  point  or  straight  line  segment.  The 
3-D  motion  of  the  object  between  frames  defines  the  system  model  of  the  filter  and  the  per¬ 
spective  projection  of  the  object  onto  each  image  defines  the  measurement  model.  This  implies 
that  the  measurement  equations  (the  perspective  projection)  are  non-linear  functions  of  the  state 
variables  (e.g.  the  3-D  position  vector);  this  requires  linearization  in  the  update  equations  and 
implies  that  the  error  distribution  of  the  3-D  coordinates  will  not  be  Gaussian.  In  the  case  of 
arbitrary  camera  motion,  a  further  complication  is  that  it  is  difficult  to  reliably  track  features 
between  frames.  In  this  section,  we  will  describe  in  detail  an  approach  to  feature-based  Kalman 
filtering  for  lateral  motion,  which  tracks  edgels  along  each  scanline,  and  avoids  the  problems 


associated  with  non-linear  measurement  equations.  Extensions  to  arbitrary  motion  can  be  based 
on  the  method  presented  here. 

5.1  Kalman  filter  formulation  for  lateral  motion 

Lateral  camera  translation  considerably  simplifies  the  feature  tracking  problem,  since  in  this 
case  features  flow  along  scanlines.  Moreover,  the  position  of  a  feature  on  a  scanline  is  a  linear 
function  of  the  distance  moved  by  the  camera,  since 

Ax  =  Txd  <s>  x,  =  Xq  +  tTxd 

where  Xq  is  the  position  of  the  feature  in  the  first  frame  and  d  is  the  inverse  depth  of  the  feature. 
The  epipolar  plane  image  method  [Bolles87]  exploits  these  characteristics  by  extracting  lines  in 
“space-time”  (epipolar  plane)  images  formed  by  concatenating  scanlines  from  an  entire  image 
sequence.  However,  sequential  estimation  techniques  like  Kalman  filtering  are  a  more  practical 
approach  to  this  problem  because  they  allow  images  to  be  processing  on-line  by  incrementally 
refining  the  depth  model. 

Taking  xo  and  d  as  the  state  variables  defining  the  location  of  the  feature,  instead  of  the  3-D 
coordinates  X  and  Z,  keeps  the  entire  estimation  problem  linear.  This  is  advantageous  because 
it  avoids  the  approximations  needed  for  error  estimation  with  non-linear  equations.  For  point 
features,  if  the  position  of  the  feature  in  each  image  is  given  by  the  sequence  of  measurements 
x  =  [x0,xl,...x„]T,  knowledge  of  the  camera  position  for  each  image  allows  the  feature  location 
to  be  determined  by  fitting  a  line  to  the  measurement  vector  x: 

x  =  H  (16) 

d 

where  H  is  a  (2  x  n  +  1)  matrix  whose  first  column  contains  all  l’s  and  whose  second  column 
is  defined  by  the  camera  position  for  each  frame,  relative  to  the  initial  camera  position.  This  fit 
can  be  computed  sequentially  by  accumulating  the  terms  of  the  normal  equation  solution  for  x0 
and  d.  The  covariance  matrix  E  of  x0  and  d  can  be  determined  from  the  covariance  matrix  of 
the  measurement  vector  x. 

The  approach  outlined  above  uses  the  position  of  the  feature  in  the  first  frame  *o  as  one 
of  the  two  state  variables.  We  can  reformulate  this  in  terms  of  the  current  frame  by  taking  x, 
and  d  to  be  the  state  variables.  Assuming  that  the  camera  motion  is  exact  and  that  measured 
feature  positions  have  normally  distributed  uncertainty  with  variance  a],  the  initial  state  vector 
and  covariance  matrix  are  expressed  in  terms  of  ideal  image  coordinates  as 
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where  Ti  is  the  camera  translation  between  the  first  and  second  frame.  The  covariance  matrix 
comes  from  applying  standard  linear  error  propagation  methods  to  the  equations  for  x\  and  a 
[Maybeck79]. 

After  initialization,  if  T,  is  the  translation  between  frames  t  —  1  and  t,  the  motion  equations 
that  transform  the  state  vector  and  covariance  matrix  to  the  current  frame  are 
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The  superscript  minuses  indicate  that  these  estimates  do  not  incorporate  the  measured  edge 
position  at  time  t.  The  newly  measured  edge  position  x,  is  incorporated  by  computing  the 
updated  covariance  matrix  Pf,  a  gain  matrix  K,  and  the  updated  parameter  vector  u*\ 


P*t  =  {(^  )  1+S}  1  where  5  =  — 


ut  =  u~  +  K[x,  -  x,  ]. 

Since  these  equations  are  linear,  we  can  see  how  uncertainty  decreases  as  the  number  of 
measurements  increases  by  computing  the  sequence  of  covariance  matrices  P„  given  only  the 
measurement  uncertainty  cr^  and  the  sequence  of  camera  motions  Tt.  This  is  addressed  in  Section 
6.1. 

Note  that  the  equations  above  can  be  generalized  to  arbitrary,  uncertain  camera  motion  using 
either  the  x,  y,  d  image-based  parameterization  of  point  locations  or  an  X,  T,  Z  three-dimensional 
parameterization.  The  choice  of  parameterization  may  prove  to  be  an  important  factor  in  the 
success  of  general  depth  from  motion  algorithms,  but  we  have  not  thus  far  addressed  this  question. 

5.2  Feature  extraction  and  matching 

To  implement  the  feature-based  depth  estimator,  we  must  specificy  how  to  extract  feature  posi¬ 
tions,  how  to  estimate  the  noise  level  in  those  positions,  and  how  to  track  features  from  frame 
to  frame.  For  lateral  motion,  with  image  flow  parallel  to  the  scanlines,  tracking  edgels  on  each 
scanline  is  the  most  natural  implementation.  Therefore,  in  this  section  we  will  describe  how  we 
extract  edges  to  sub-pixel  precision,  how  we  estimate  the  variance  of  the  edge  positions,  and 
how  we  track  edges  from  frame  to  frame. 
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For  one-dimensional  signals,  estimating  the  variance  of  edge  positions  has  been  addressed  in 
[Canny86].  We  will  review  this  analysis  before  considering  the  general  case.  In  one  dimension, 
edge  extraction  amounts  to  finding  the  zero  crossings  in  the  second  derivative  of  the  Gaussian- 
smoothed  signal,  which  is  equivalent  to  finding  zero-crossings  after  convolving  the  image  with 
a  second  derivative  of  Gaussian  operator, 


F(x)  = 


d2G{x) 

dx2 


*/(*). 


We  assume  that  the  image  /  is  corrupted  by  white  noise  with  variance  a2.  Splitting  the  response 
of  the  operator  into  that  due  to  the  signal,  Fs,  and  that  due  to  noise,  Fn,  edges  are  marked  where 


Fs(x)+Fn(x)  =  0. 


(19) 


An  expression  for  the  edge  variance  is  obtained  by  taking  a  first-order  Taylor  expansion  of  the 
deterministic  part  of  the  response  in  the  vicinity  of  the  zero  crossing,  then  taking  mean  square 
values.  Thus,  if  the  zero  crossing  occurs  at  Xo  in  the  noise  free  signal  and  x0  +  6x  in  the  noisy 
signal,  we  have 

F(x0  +  Sx )  «  Fs(x0)  +  F's(x0)6x  +  Fn(x0  +  6x)  =  0,  (20) 


so  that 


Sx  = 


-(Fn(x0  +  6x)  +  Fs(xy)) 
F's(xq) 


(21) 


The  presence  of  a  zero  crossing  implies  that  Fs(x 0)  =  0  and  the  assumption  of  zero  mean  noise 
implies  that  E[E„Cto)]  =  0.  Therefore,  the  variance  of  the  edge  position  is 


E[6; c2]  =  a2  = 


a2nE[(Fn(x q))2] 
(F'O Co))2 


(22) 


In  a  discrete  implementation,  £[(En(^o))2]  is  the  sum  of  the  squares  of  the  coefficients  in  the 
convolution  mask.  F's(xo)  is  the  slope  of  the  zero  crossing  and  is  approximated  by  fitting  a  local 
curve  to  the  filtered  image.  The  zero  crossing  of  this  curve  gives  the  es  nate  of  the  sub-pixel 
edge  position. 

For  two-dimensional  images,  an  analogous  edge  operator  is  a  directional  derivative  filter 
with  a  derivative  of  Gaussian  profile  in  one  direction  and  a  Gaussian  profile  in  the  orthogonal 
direction.  Assuming  that  the  operator  is  oriented  to  take  the  derivative  in  the  direction  of  the 
gradient,  the  analysis  above  will  give  the  variance  of  the  edge  position  in  the  direction  of  the 
gradient  (see  [Nalwa86]  for  an  alternate  approach).  However,  for  edge  tracking  along  scanlines, 
we  require  the  variance  of  the  edge  position  in  the  scanline  direction,  not  the  gradient  direction. 
This  is  straightforward  to  compute  for  the  difference  of  Gaussian  {DOG)  edge  operator,  the 
required  variance  estimate  comes  directly  from  equations  (19)  -  (22),  replacing  F  with  the  DOG 


and  F  with  the  partial  derivative  d/dx.  Details  of  the  discrete  implementation  in  this  case  are 
similar  to  those  described  above.  Experimentally,  the  cameras  and  digitizing  hardware  we  use 
provide  8-bit  images  with  intensity  variance  «  4. 

It  is  worth  emphasizing  that  estimating  the  variance  of  edge  positions  is  more  than  a  math¬ 
ematical  nicety;  it  is  valuable  in  practice.  The  uncertainty  in  the  position  of  an  edge  is  affected 
by  the  contrast  of  the  edge,  the  amount  of  noise  in  the  image,  and,  in  matching  applications 
such  as  this  one,  by  the  edge  orientation.  For  example,  in  tracking  edges  under  lateral  motion, 
edges  that  are  close  to  horizontal  provide  much  less  precise  depth  estimates  than  edges  that  are 
vertical.  Estimating  variance  quantifies  these  differences  in  precision.  Such  quantification  is 
important  in  predictive  tracking,  fitting  surface  models,  and  applications  of  depth  from  motion 
to  constraining  stereo.  These  remarks  of  course  apply  to  image  features  in  general,  not  just  to 
edges. 

Tracking  features  from  frame  to  frame  is  very  simple  if  either  the  camera  motion  is  very  small 
or  the  feature  depth  is  already  known  quite  accurately.  In  the  former  case,  a  search  window  is 
defined  that  limits  the  feature  displacement  to  a  small  number  of  pixels  from  the  position  in  the 
previous  image.  For  the  experiments  described  in  Section  6,  tracking  was  implemented  this  way, 
with  a  window  width  of  two  pixels.  Alternatively,  when  the  depth  of  a  feature  is  already  known 
fairly  accurately,  the  position  of  the  feature  in  a  new  image  can  be  predicted  from  equation  (17) 
to  be 

xj  =*,-!  +7X-i, 

the  variance  of  the  prediction  can  be  determined  from  equation  (18),  and  a  search  window  can  be 
defined  as  a  confidence  interval  estimated  from  this  variance.  This  allows  tight  search  windows 
to  be  defined  for  existing  features  even  when  the  camera  motion  is  not  small.  A  simplified 
version  of  this  procedure  is  used  in  our  implementation  to  ensure  that  candidate  edge  matches  are 
consistent  with  the  existing  depth  model.  The  predefined  search  window  is  scanned  for  possible 
matches,  and  these  are  accepted  only  if  they  lie  within  some  distance  of  the  predicted  edge 
location.  Additional  acceptance  criteria  require  the  candidate  match  to  have  properties  similar  to 
those  of  the  feature  in  the  previous  image;  for  edges,  these  properties  are  edge  orientation  and 
edge  strength  (gradient  magnitude  or  zero-crossing  slope).  Given  knowledge  of  the  noise  level 
in  the  image,  this  comparison  function  can  be  defined  probabilistically  as  well,  but  we  have  not 
pursued  this  direction. 

Finally,  if  the  noise  level  in  the  image  is  unknown  it  can  be  estimated  from  the  residuals  of  the 
observations  after  x  and  d  have  been  determined.  Such  methods  are  discussed  in  [Mikhail76]  for 
batch  oriented  techniques  analogous  to  equation  (16)  and  in  [Maybeck82]  for  Kalman  filtering. 
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6  Evaluation 


In  this  section,  we  compare  the  performance  of  the  iconic  and  feature-based  depth  estimation 
algorithms  in  three  ways.  First,  we  perform  a  mathematical  analysis  of  the  reduction  in  depth 
variance  as  a  function  of  time.  Second,  we  use  a  sequence  of  images  of  a  flat  scene  to  determine 
the  quantitative  performance  of  the  two  approaches  and  to  check  the  validity  of  our  analysis. 
Third,  we  test  our  algorithms  on  images  of  realistic  scenes  with  complicated  variations  in  depth. 

6.1  Mathematical  analysis 

We  wish  to  compare  the  theoretical  variance  of  the  depth  estimates  obtained  by  the  iconic  method 
of  Section  4  to  those  obtained  by  the  feature-based  method  of  Section  5.  We  will  also  compare 
the  accuracy  of  both  methods  to  the  accuracy  of  stereo  matching  with  the  first  and  last  frames  of 
the  image  sequence.  To  do  this,  we  will  derive  expressions  for  the  depth  variance  as  a  function 
of  the  number  of  frames  processed,  assuming  a  constant  noise  level  in  the  images  and  constant 
camera  motion  between  frames.  For  clarity,  we  will  assume  this  motion  is  Tx  =  1. 

Iconic  approach 

For  the  iconic  method,  we  will  ignore  process  noise  in  the  system  model  and  assume  that 
the  variance  of  successive  flow  measurements  is  constant.  For  lateral  motion,  the  equations 
developed  in  Section  2  can  be  simplified  to  show  that  the  Kalman  filter  simply  computes  the 
average  flow  [Gelb74],  Therefore,  a  sequence  of  flow  measurements  Ax i,  Ax 2,  ...,  Axt  is 
equivalent  to  the  following  batch  measurement  equation 


Estimating  d  by  averaging  the  flow  measurements  implies  that 

1  1  1 

d  =  -HtA\  =  -  Axt.  (23) 

1  1  ;=i 

If  the  flow  measurements  were  independent  with  variance  2 cr\ja,  where  an  is  the  noise  level  in 
the  image  (Appendix  A),  the  resulting  variance  of  the  disparity  estimate  would  be 


However,  the  flow  measurements  are  not  actually  independent.  Because  noise  is  present  in  every 
image,  flow  measurements  between  frames  i  —  1  and  i  will  be  correlated  with  measurements  for 
frames  i  and  *  +  1.  Appendix  A  shows  that  a  sequence  of  correlation-based  flow  measurements 
that  track  the  same  point  in  the  image  sequence  will  have  the  following  covariance  matrix: 
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where  a 2  is  the  level  of  noise  in  the  image  and  a  reflects  the  local  slope  of  the  intensity  surface. 
With  this  covariance  matrix,  averaging  the  flow  measurements  actually  yields  the  following 
variance  for  the  estimated  flow: 

t  1  r  2er2 

aj(t)  =  -^HTPmH=-^.  (25) 

This  is  interesting  and  rather  surprising.  Comparing  equations  (24)  and  (25),  the  correlation 
structure  that  exists  in  the  measurements  means  that  the  algorithm  converges  faster  than  we  first 
expected. 

With  correlated  measurements,  averaging  the  flow  measurements  in  fact  is  a  sub-optimal 
estimator  for  d.  The  optimal  estimator  is  obtained  by  substituting  the  expressions  for  H  and  Pm 
into  equations  (12)  and  (13).  This  estimator  does  not  give  equal  weight  to  all  flow  measurements; 
instead,  measurements  near  the  center  of  the  sequence  receive  more  weight  than  those  near  the 
end.  The  variance  of  the  depth  estimate  is 

rW  t(t+l)(t  +  2)a' 

The  optimal  convergence  is  cubic,  whereas  the  convergence  of  the  averaging  method  we  im¬ 
plemented  is  quadratic.  Developing  an  incremental  version  of  the  optimal  estimator  requires 
extending  our  Kalman  filter  formulation  to  model  the  correlated  nature  of  the  measurements. 
This  extension  is  currently  being  investigated. 


Feature-based  approach 

For  the  feature  based  approach,  the  desired  variance  estimates  come  from  computing  the  sequence 
of  covariance  matrices  Pt,  as  mentioned  at  the  end  of  Section  5.1.  A  closed  form  expression  for 


this  matrix  is  easier  to  obtain  from  the  batch  method  suggested  by  equation  (16)  than  from  the 
Kalman  filter  formulation  and  yields  an  equivalent  result.  Taking  the  constant  camera  translation 
to  be  Tx  =  1  for  simplicity,  equation  (16)  expands  to 
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Recall  that  1,  are  the  edge  positions  in  each  frame,  x0  is  the  best  fit  edge  position  in  the  first 
frame,  and  d  is  the  best  fit  displacement  or  flow  between  frames.  Since  we  assume  that  the 
measured  edge  positions  x,  are  independent  with  equal  variance  c2,  we  find  that 
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The  summations  can  be  expressed  in  closed  form,  leading  to  the  conclusion  that 
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The  variance  of  the  displacement  or  flow  estimate  d  thus  decreases  as  the  cube  of  the  number  of 
images.  This  expression  is  identical  in  structure  to  the  optimal  estimate  for  the  iconic  approach, 
the  only  difference  being  the  replacement  of  the  variance  of  the  SSD  minimum  by  the  variance 
of  the  edge  position.  Thus,  if  our  estimators  incorporate  appropriate  models  of  measurement 
noise,  the  iconic  and  feature-based  methods  theoretically  achieve  the  same  rate  of  convergence. 
This  is  surprising,  given  that  the  basic  Kalman  filter  for  the  iconic  method  maintains  only  one 
state  parameter  (d)  for  each  pixel,  whereas  the  feature-based  method  maintains  two  per  feature 
(xo  and  d).  We  suspect  that  an  incremental  version  of  the  optimal  iconic  estimator  will  require 
the  same  amount  of  state  as  the  feature-based  method. 


Comparison  with  stereo 

To  compare  these  methods  to  stereo  matching  on  the  first  and  last  frames  of  the  image  sequence, 
we  must  scale  the  stereo  disparity  and  its  uncertainty  to  be  commensurate  with  the  flow  between 
frames.  This  implies  dividing  the  stereo  disparity  by  t  and  the  uncertainty  by  f2.  For  the  iconic 
method,  we  assume  that  the  uncertainty  in  a  stereo  measurement  will  be  the  same  as  that  for  an 
individual  flow  measurement.  Thus,  the  scaled  uncertainty  is 


This  is  the  same  as  is  achieved  with  our  incremental  algorithm  which  processes  all  of  the 
intermediate  frames.  Therefore,  processing  the  intermediate  frames  (while  ignoring  the  temporal 
correlation  of  the  measurements)  may  improve  the  reliability  of  the  matching,  but  in  this  case  it 
does  not  improve  precision. 

For  the  feature-based  approach,  the  uncertainty  in  stereo  disparity  is  twice  the  uncertainty  a] 
in  the  feature  position;  the  scaled  uncertainty  is  therefore 


In  this  case  using  the  intermediate  frames  helps,  since 

_  1 

Thus,  extracting  depth  from  a  small-motion  image  sequence  has  several  advantages  over 
stereo  matching  between  the  first  and  last  frames.  The  ease  of  matching  is  increased,  reducing 
the  number  of  correspondence  errors.  Occlusion  is  less  of  a  problem,  since  it  can  be  predicted 
from  early  measurements.  Finally,  better  accuracy  is  available  by  using  the  feature  based  method 
or  the  optimal  version  of  the  iconic  method. 

6.2  Quantitative  experiments:  flat  images 

The  goals  of  our  quantitative  evaluation  were  to  examine  the  actual  convergence  rates  of  the 
dejpth  estimators,  to  assess  the  validity  of  the  noise  models,  and  to  compare  the  performance  of 
the  iconic  and  feature-based  algorithms.  To  obtain  ground  truth  depth  data,  we  used  the  facilities 
of  the  Calibrated  Imaging  Lab  at  CMU  to  digitize  a  sequence  of  images  of  a  flat-mounted  poster. 
We  used  a  Sony  XC-37  CCD  camera  with  a  16mm  lens,  which  gave  a  field  of  view  of  36 
degrees.  The  poster  was  set  about  20  inches  (51  cm)  from  the  camera.  The  camera  motion 
between  frames  was  0.04  inches  (1  mm),  which  gave  an  actual  flow  of  approximately  two  pixels 
per  frame  in  480x512  images.  For  convenience,  our  experiments  were  run  on  images  reduced 
to  240x256  by  Gaussian  convolution  and  subsampling.  The  image  sequence  we  will  discuss 
here  was  taken  with  vertical  camera  motion.  This  proved  to  give  somewhat  better  results  than 
horizontal  motion;  we  attribute  this  to  jitter  in  the  scanline  clock,  which  induces  more  noise  in 
horizontal  flow  than  in  vertical  flow. 

Figure  8  shows  the  poster  and  the  edges  extracted  from  it.  For  both  the  iconic  and  the 
feature-based  algorithms,  a  ground  truth  value  for  the  depth  was  determined  by  fitting  a  plane 
to  the  measured  values.  The  level  of  measurement  noise  was  then  estimated  by  computing 
the  RMS  deviation  of  the  measurements  from  the  plane  fit.  Optical  aberrations  made  the  flow 
measurements  consistently  smaller  near  the  periphery  of  the  image  than  the  center,  so  the  RMS 
calculation  was  performed  over  only  the  center  quarter  of  the  image.  Note  that  all  experiments 
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Figure  8:  Tiger  image  and  edges 


described  in  this  section  did  not  use  regularization  to  smooth  the  depth  estimates,  so  the  results 
show  only  the  effect  of  the  Kalman  filtering  algorithm. 

To  determine  the  reliability  of  the  flow  variance  estimates,  we  grouped  flow  measurements 
produced  by  the  SSD  algorithm  according  to  their  estimated  variances,  took  sample  variances 
over  each  group,  and  plotted  the  SSD  variance  estimates  against  the  sample  variances  (Figure 
9).  The  strong  linear  relationship  indicates  fairly  reliable  variance  estimates.  The  deviation  of 
the  slope  of  the  line  from  the  ideal  value  of  1  is  due  to  an  inaccurate  estimate  of  the  image  noise 

( *2  )• 

To  examine  the  convergence  of  the  Kalman  filter,  the  RMS  depth  error  was  computed  for 
the  iconic  and  the  feature-based  algorithms  after  processing  each  image  in  the  sequence.  We 
computed  two  sets  of  statistics,  one  for  “sparse”  depth  and  one  for  “dense”  depth.  The  sparse 
statistic  computes  the  RMS  error  for  only  those  pixels  where  both  algorithms  gave  depth  estimates 
(that  is,  where  edges  were  found),  whereas  the  dense  statistic  computes  the  RMS  error  of  the 
iconic  algorithm  over  the  full  image.  Figure  10  plots  the  relative  RMS  errors  as  a  function  of 
the  number  of  images  processed.  Comparing  the  sparse  error  curves,  the  convergence  rate  of 
the  iconic  algorithm  is  slower  than  the  feature-based  algorithm,  as  expected.  In  this  particular 
experiment,  both  methods  converged  to  an  error  level  of  approximately  0.5%  percent  after 
processing  eleven  images.  Since  the  poster  was  20  inches  from  the  camera,  this  equates  to  a 
depth  error  of  0.1  inches.  Note  that  the  overall  baseline  between  the  first  and  the  eleventh  image 
was  only  0.44  inches. 

To  compare  the  theoretical  convergence  rates  derived  earlier  to  the  experimental  rates,  the 
theoretical  curves  were  scaled  to  coincide  with  the  experimental  error  after  processing  the  first  two 


* i  -'4  kt.  •  *  WV  V  s1  V  V 


i 

k 


1 


I 


m 


i 


'53 


c 


i 


( 


i 


£ 


3 


S3 


1 


m 


JS  0.20 

>< 

‘a. 


0.05  0.10  0.15  0.20 

Estimated  standard  deviation  (pixels) 

Figure  9:  Scatter  plot 


frames.  These  scaled  curves  arc  also  shown  in  Figure  10.  For  the  iconic  method,  the  theoretical 
rate  plotted  is  the  quadratic  convergence  predicted  by  the  correlated  flow  measurement  model. 
The  agreement  between  theory  and  practice  is  quite  good  for  the  first  three  frames.  Thereafter,  the 
experimental  RMS  error  decreases  more  slowly;  this  is  probably  due  to  the  effects  of  unmodeled 
sources  of  noise.  For  the  feature-based  method,  the  experimental  error  initially  decreases  faster 
than  predicted  because  the  implementation  required  new  edge  matches  to  be  consistent  with  the 
prior  depth  estimate.  When  this  requirement  was  dropped,  the  results  agreed  very  closely  with 
the  expected  convergence  rate. 

Note  that  the  comparison  between  theoretical  and  experimental  results  also  allows  us  to 
estimate  the  precision  of  the  sub-pixel  edge  extractor.  The  variance  of  a  disparity  estimate  is 
twice  the  variance  of  the  edge  positions.  Since  the  frame-to-frame  displacement  in  this  image 
sequence  was  one  pixel  and  the  relative  RMS  error  was  12%  for  the  first  disparity  estimate,  the 
RMS  error  in  edge  localization  was  0.12/\/2  ~  0.09  pixels. 

Finally,  Figure  10  also  compares  the  RMS  error  for  the  sparse  and  dense  depth  estimates 
from  the  iconic  method.  The  dense  flow  field  is  considerably  noisier  than  the  flow  estimates  that 
coincide  with  edges,  though  still  just  over  two  percent  error  by  the  end  of  eleven  frames.  Some 
of  this  error  is  due  to  a  systematic  bias  produced  by  the  SSD  flow  estimator  in  the  vicinity  of 
ramp  edges. 

Figure  lib  shows  the  intensity  profile  of  a  vertical  slice  taken  through  the  test  bar  image 
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(Figure  11a).  As  can  be  seen,  the  disparity  estimate  (Figure  11c)  is  biased  low  (away  from  the 
“true”  value  in  the  central  flat  part)  on  one  side  of  the  discontinuity,  and  biased  “high”  on  the 
other.  This  bias  can  also  be  confirmed  by  using  an  analytic  model  of  a  ramp  edge.  Fortunately, 
the  variance  estimates  (Figure  lid)  reflect  this  larger  error,  so  regularization-based  smoothing 
can  compensate  for  this  systematic  error.  We  conclude  that  the  dense  depth  estimates  do  provide 
fairly  good  depth  information. 

6.3  Qualitative  experiments:  real  scenes 

We  have  tested  the  iconic  and  edge-based  algorithms  on  complicated,  realistic  scenes  obtained 
from  the  Calibrated  Imaging  Laboratory.  Two  sequences  of  ten  images  were  taken  with  camera 
motion  of  0.05  inches  (1.27mm)  between  frames;  one  sequence  moved  the  camera  vertically, 
the  other  horizontally.  The  overall  range  of  motion  was  therefore  0.5  inches  (1.27  cm);  this 
compares  with  distances  to  objects  in  the  scene  of  20  to  40  inches  (5 1  to  102  cm). 

Figure  12  shows  one  of  the  images  (a  picture  of  a  miniature  town).  Figures  13a-d  show 
a  reduced  version  of  the  image,  the  edges  extracted  from  it  with  an  oriented  Canny  operator 
[Canny86],  and  depth  maps  produced  by  applying  the  iconic  algorithm  to  the  horizontal  and 
vertical  image  sequences,  respectively.  Lighter  areas  in  the  depth  maps  are  nearer.  The  main 
structure  of  the  scene  is  recovered  quite  well  in  both  cases,  :hough  the  results  with  the  horizontal 
sequence  are  considerably  more  noisy.  This  is  most  likely  due  to  scanline  jitter,  as  mentioned 
earlier.  Edges  oriented  parallel  to  the  direction  of  flow  cause  some  scene  structure  to  be  observ¬ 
able  in  one  sequence  but  not  the  other.  This  is  most  noticeable  near  the  center  of  the  scene, 
where  a  thin  vertical  object  appears  in  Figure  13c  but  is  not  visible  in  Figure  13d.  This  object 
corresponds  to  an  antenna  on  the  top  of  a  foreground  building  (Figure  13a).  In  general,  motion 
in  orthogonal  directions  will  yield  more  information  than  motion  in  any  single  direction. 

Figure  14  shows  intensity-coded  depth  maps  and  3-D  perspective  reconstructions  obtained 
with  both  the  iconic  and  feature-based  methods.  These  results  were  produced  by  combining 
disparity  estimates  from  both  horizontal  and  vertical  camera  motion.  The  depth  map  for  the 
feature-based  approach  was  produced  from  the  sparse  depth  estimates  by  regularization.  It  is 
difficult  to  make  quantitative  statements  about  the  performance  of  either  method  from  this  data, 
but  qualitatively  it  is  clear  that  both  recover  the  structure  of  the  scene  quite  well. 

The  iconic  algorithm  was  also  used  to  extract  occluding  boundaries  from  the  depth  map  of 
Figure  13c  (iconic  method  with  vertical  camera  motion).  We  first  computed  an  intrinsic  “grazing 
angle”  image  giving  the  angle  between  the  view  vector  through  each  pixel  and  the  normal  vector 
of  the  local  3-D  surface.  Edge  detection  and  thresholding  were  applied  to  this  image  to  find 
pixels  where  the  view  vector  and  the  surface  normal  were  nearly  perpendicular.  The  resulting 
boundaries  are  shown  along  with  the  depth  map  in  Figure  15.  The  method  found  most  of  the 
prominent  building  outlines  and  the  outline  of  the  bridge  in  the  upper  left. 
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Figure  14:  CIL  orthogonal  motion  results 


la)  iconic  method  depth  map  (b)  perspective  view  (c)  feature-based  method  depth  map  (d) 

perspective  view 

33 


*  A  A  *  jficA  -m.  Aa  ,*fw.  ft 


Figure  17:  CIL-2  orthogonal  motion  results 


(a)  iconic  method  depth  map  (b)  perspective  view  (c)  feature-based  method  depth  map  (d) 

perspective  view 
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Figures  16  and  17  show  the  results  of  our  algorithms  on  a  different  model  set  up  in  the 
Calibrated  Imaging  Laboratory.  The  same  camera  and  camera  motion  were  used  as  before.  Figure 
16  shows  the  first  frame,  the  extracted  edges,  and  the  depth  maps  obtained  from  horizontal  and 
vertical  motion.  Figure  17  shows  the  depth  maps  and  the  perspective  reconstructions  obtained 
with  the  iconic  and  feature-based  methods.  Again,  the  algorithms  did  a  good  job  in  recovering 
the  structure  of  the  scene. 

Finally,  we  present  the  results  of  using  the  first  10  frames  of  the  image  sequence  used  in 
[Bolles87].  Figure  18  shows  the  first  frame  from  the  sequence,  the  extracted  edges,  and  the  depth 
maps  obtained  from  running  the  iconic  and  feature-based  algorithms.  As  expected,  the  results 
from  using  the  feature-based  method  are  similar  to  those  obtained  widi  the  Epipolar-Plane  Image 
technique.  The  iconic  algorithm  produces  a  denser  estimate  of  depth  than  is  available  from 
either  edge-based  technique.  These  results  show  that  the  sparse  (edge-based)  batch  processing 
algorithm  for  small  motion  sequences  introduced  in  [Bolles87]  can  be  extended  to  use  dense 
depth  maps  and  incremental  processing. 

7  Conclusions 

This  paper  has  presented  a  new  algorithm  for  extracting  depth  from  known  motion.  The  algorithm 
processes  an  image  sequence  taken  with  small  inter-frame  displacements  and  produces  an  on-line 
estimate  of  depth  that  is  refined  over  time.  The  algorithm  produces  a  dense,  iconic  depth  map 
and  is  suitable  for  implementation  on  parallel  architectures. 

The  on-line  depth  estimator  is  based  on  Kalman  filtering.  A  correlation-based  flow  algorithm 
measures  both  the  local  displacement  at  each  pixel  and  the  confidence  (or  variance)  of  the 
displacement.  These  two  “measurement  images”  are  integrated  with  predicted  depth  and  variance 
maps  using  a  weighted  least  squares  technique  derived  from  the  Kalman  filter.  Regularization- 
based  smoothing  is  used  to  reduce  the  noise  in  the  flow  estimates  and  to  fill  ir>  areas  of  unknown 
disparity.  The  current  maps  are  extrapolated  to  the  next  frame  by  image  warping,  using  the 
knowledge  of  the  camera  motion,  and  are  resampled  to  keep  the  maps  iconic. 

The  algorithm  has  been  implemented,  evaluated  mathematically  and  experimentally,  and  com¬ 
pared  with  a  feature-based  algorithm  that  uses  Kalman  filtering  to  estimate  the  depth  of  edges. 
The  mathematical  analysis  shows  that  the  iconic  approach  will  have  a  slower  convergence  rate 
because  it  only  keeps  one  element  of  state  per  pixel  (the  disparity),  while  the  feature-based 
approach  keeps  both  the  disparity  and  the  sub-pixel  position  of  the  feature.  However,  an  optimal 
implementation  of  the  iconic  method  (which  takes  into  account  temporal  correlations  in  the  mea¬ 
surements)  has  the  potential  to  equal  the  convergence  rate  and  accuracy  of  the  symbolic  method 
Experiments  with  images  of  a  flat  poster  have  confirmed  this  analysis  and  given  quantitative 
measures  of  the  performance  of  both  algorithms.  Finally,  experiments  with  images  of  a  realistic 
outdoor  scene  model  have  shown  that  the  new  algorithm  performs  well  on  images  with  large 
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Figure  18:  SRI  EPI  sequence  results 

(a)  first  frame  (b)  edges  (c)  iconic  method  depth  map  (d)  feature-based  method  depth  map 


variations  in  depth  and  that  occluding  boundaries  can  be  extracted  from  the  resulting  depth  maps. 
Extensions 

The  algorithms  described  in  this  paper  can  be  extended  in  several  ways.  The  most  straightforward 
extension  is  to  the  case  of  non-lateral  motion.  As  sketched  in  Section  4,  this  can  be  accomplished 
by  designing  a  correlation- based  flow  estimator  that  produces  two-dimensional  flow  vectors  and 
an  associated  covariance  matrix  estimate  [Anandan85].  This  approach  can  also  be  used  when  the 
camera  motion  is  uncertain,  or  when  the  camera  motion  is  variable  (e.g.  for  widening  baseline 
stereo  [Xu85]).  The  alternative  of  searching  only  along  epipolar  lines  during  the  correlation 
phase  may  be  easier  to  implement,  but  is  less  general. 

More  research  is  required  into  the  behavior  of  the  correlation  based  flow  and  confidence  esti¬ 
mator.  In  particular,  we  have  observed  that  our  current  estimator  produces  biased  estimates  in  the 
vicinity  of  intensity  step  edges.  The  correlation  between  spatially  adjacent  flow  estimates,  which 
is  currently  ignored,  should  be  integrated  into  the  Kalman  filter  framework.  More  sophisticated 
representations  for  the  intensity  and  depth  fields  are  also  being  investigated  [Szeliski87b]. 

Finally,  the  incremental  depth  from  motion  algorithms  which  we  have  developed  can  be  used 
to  initiate  stereo  fusion.  Work  is  currently  in  progress  investigating  the  integration  of  depth- 
ffom -motion  and  stereo  [Matthies87a].  We  believe  that  the  framework  presented  in  this  paper 
will  prove  to  be  useful  for  integrating  information  from  multiple  visual  sources  and  for  tracking 
such  information  in  a  dynamic  environment. 
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A  Optic  flow  computation 


In  this  appendix,  we  will  analyze  the  performance  of  a  simple  correlation  based  flow  estimator, 
the  sum  of  squared  differences  (SSD)  estimator  [Anandan85].  This  estimator  selects  at  each 
pixel  the  disparity  which  minimizes  the  SSD  measure 


e(d;x)  =  J  w(\)[fi(x  +  d  +  \)-f0(x  +  \)]2d\, 


where  fo(x)  and/i(x)  are  the  two  successive  image  frames,  and  w(x)  is  a  symmetric,  non-negative 
weighting  function.  To  analyze  its  performance,  we  will  assume  that  the  two  image  frames  are 
generated  from  an  underlying  true  intensity  image,  f(x),  to  which  uncorrelated  (white)  Gaussian 
noise  with  variance  a 2  has  been  added: 


fo(x)  =f(x)  +  no(x), 
fi(x  +  d)  =f(x)  +  n  i(x). 


Using  this  model,  we  can  rewrite  the  error  measure  as5 


e(d; x)  =  J  w(X)\f(x  +  d—  d  +  A)  —  f(x  +  A)  +  n\(x  +  A)  —  rto(x  +  A)]2 


If  d  ~  d,  we  can  use  a  Taylor  series  expansion  to  obtain 


c(d;x)  =  J  W(\)[f(x  +  \)](d-d)2+2W(\Y(x  +  \) 

[«i(x  +  A)~  /!o(x  + A)](S-  cO  w(A)[«x(^  •+•  A)  —  rio(x  +  X)]2dX 


=  a(x)(d  —  df  +  2[b\(x)  -  bQ{x)]{d  -  d)  +  c( x), 


where 


a(x)  =  J  w(X)\f(x  +  X)]2dX , 
biix)  =  J  w(X)f(x  +  X )m(x  +  X)dX, 
c(x)  =  J  vKAX/qfx  +  A)  -  no(x  +  A)]2  dX. 


The  four  coefficients  a(x),  b0(x),  b\{x)  and  c(x)  define  the  shape  of  the  error  surface  e(d;  x). 
The  first  coefficient,  a(x),  is  related  to  the  average  “roughness”  or  “slope”  of  the  intensity  surface. 


5This  equation  is  actually  incorrect,  since  it  should  contain  n\(x  +  d-  d  +  X)  instead  of  n\{x  +  X).  The  effect  of 
including  the  correct  term  is  to  add  small  random  terms  involving  integrals  of  w(A),  w1  (X),  f  (x  +  X),  f"  (x  +  X)  and 
rti(x)  to  the  quadratic  coefficient  a(x),  b\(x)  and  c(x)  that  are  derived  below.  This  intentional  omission  has  been 
made  to  simplify  the  presentation. 
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and  determines  the  confidence  given  to  the  disparity  estimate  (see  below).  The  second  and  third 
coefficients,  bo(x)  and  b\(x),  are  independent  zero  mean  Gaussian  random  variables  that  determine 
the  difference  between  d  and  d,  i.e.  the  error  in  flow  estimator.  The  fourth  coefficient,  c(x),  is 
a  chi-squared  distributed  random  variable  with  mean  (2cr2  f  w(X)dX),  and  defines  the  computed 
error  at  d  =  d. 

To  estimate  the  disparity  at  point  jc  given  the  error  surface  e(d\  x),  we  find  the  d  such  that 

e(d;  x)  =  min  (d;  x). 

d 

From  the  above  quadratic6  equation,  we  can  compute  d(x)  as 


d(x)  =d  + 


b0(x)-bi(x) 


To  calculate  the  variance  in  this  estimate,  we  must  first  calculate  the  variance  in  bi(x). 


Var(b,(x))=<T2n  J  w2(X)\f(x  +  X)]2dX. 


If  we  set  w(jc)  =  1  on  some  finite  interval,  and  zero  elsewhere,  this  variance  reduces  to  a2a(x), 
and  we  obtain 

»  2a2 

Var  (d)  =  — 
a(x) 

In  addition  to  calculating  the  disparity  estimate  variance,  we  can  compute  its  covariance  with 
other  estimates  either  in  the  same  frame  or  in  a  subsequent  frame.  As  described  in  Section  6.1, 
knowing  the  correlation  between  adjacent  or  successive  measurements  is  important  in  obtaining 
good  overall  uncertainty  estimates. 

To  determine  the  correlation  between  two  adjacent  disparity  estimates,  d(x)  and  d(x  +  Ax), 
we  must  first  determine  the  correlation  between  bi(x)  and  b,(x  +  Ax), 

(bi(x)bt(x  +  Ax))  =  J  J  w(\)w(i])f(x  +  \)f  (x  +  Ax  +  rj)  (m(x  +  A )m(x  +  Ax  +  rj))  dX  dr] 

=  J  J  w(A )w(j])f{x  +  A )f  ( x  +  Ax  +  t])8(\  —  Ax  —  r])<j2  d\  dr] 

=  a\  J  w(A)w(A  —  Ax)[f  (x  +  A)]2  dX. 

For  a  slowly  varying  gradient  fix),  this  correlation  is  proportional  to  the  autocorrelation  of  the 
weighting  function, 

Rw(Ax)  =  J  w(A)w(A  +  Ax)dX. 

6The  true  equation  (when  higher  order  Taylor  series  terms  are  included)  is  a  polynomial  series  in  (d-d)  with 
random  coefficients  of  decreasing  variance.  This  explains  the  “rough”  nature  of  the  e(d\x)  observed  in  practice. 


44 


For  the  simple  case  of  w(x)  =  1  on  [-5,5],  we  obtain 


f 

I'O 


Rj(x,x  +  Ax)  =  ^(1  -  j^-)  for  \x\  <  2s. 
a(x)  2s 

The  correlation  between  two  successive  measurements  in  time  is  easier  to  compute.  Since 

f2(x  +  2d)  =f(x)  +  n2(x), 


we  can  show  that  the  flow  estimate  obtained  from  the  second  pair  of  frames  is 

N  j  ,  b2(x)  -  bdx) 

d2(x)  =  d+ - — - . 

a(x) 


The  covariance  between  d\(x)  and  d2(x)  is 

2 

Co\(d\ (x),  d2(x))  =  ({dx{x)  -  d){d2(x)  -  d))  =  -  ^ 

and  the  covariance  matrix  of  the  sequence  of  measurements  d,  is 

'2-1 
-1  2  -1 

<t2  -1 

Pm=- 
a 

2  -1 
-1  2  . 

This  structure  is  used  in  Section  6. 1  to  estimate  the  theoretical  accuracy  and  convergence  rate 
of  the  iconic  depth  from  motion  algorithm. 

B  Three-dimensional  discontinuity  detection 

To  calculate  discontinuity  in  the  depth  map,  we  compute  the  angle  between  the  local  normal 
N  and  the  view  vector  V.  The  surface  normal  at  pixel  value  (r,  c)  is  computed  by  using  the  3-D 
locations  of  the  three  points 


c  —  Cr 


P0  =  CY0,  T0,  Zo)  =  (x0,  yo,  1)3-  where  xQ  = - 

do  $x 


,yo  =  -■ 


P1=(X1,T1,Z1)  =  (x,,yi,l)-i 

d\ 

P2  =  (Xz,  Y2,  Z2)  =  (x2,  y2, 1 )— 

d2 


C  +  1  -  Cv 


=  x0  +  -,y  1  =  -■ 


c  -  cx  r  +  1  -  cv  1 

x2  =  _ —  =  x0,y2  = - -  =>’0 - • 


mmmssm 


We  can  obtain  the  normal  from  the  cross  product  of  the  two  vectors 


Qi  =  P,  -  P0  =  Tx(——+x 0(- — — — ),  (— — ~r)) 

sxd\  di  do  di  do  dt  do 

T  do 

=  -j-V( - x0Au-y0Au-Ax)  where  Ax  =  dx  -  do 

dod\  sx 

Q2  =  P2  -  Po  =  Tx(x o(— - — +yo(- — -rX  (-7- --7-)) 

dz  do  syd2  dz  do  dz  do 

Tz  do 

(-XqAz, - yo^2,  -^2)  where  Az  =  d2-  d0 


dodz 


r%  „  r*  /  doAx  doAz  _  xodoA\  y0doAzs 

VZl  X  >£2  0^  V  1  1  "r  )■ 


SXSy 


Simplifying,  we  obtain 


N  =  (-SxA,  ^2,  -do  +  x0sxAx  -  y0syAz) 
V  =  (jc0,y0, 1) 

N-V  =  -do 
N-  V 


cosS  =  iNiivr 

To  implement  the  edge  detector,  we  require  that 

cos#  <  COS0, 


or 

(s%Ai2  +  s2A22  +  (~d0  +XoSxAx  -  y0syAz)2)(x l  +  yg  +  1)  >  dosec20(. 

If  the  field  of  view  of  the  camera  is  small,  we  have  near  orthographic  projection,  and  the 
above  equations  simplify  to 


N 

V 


sxA  1  syAz 
1  do  '  do  ’ 

(xo,yo,  l) 


-l)  =  (p,q,-l) 


and  this  reduces  to  the  familiar  gradient-based  threshold 


C  Prediction  equations 

To  predict  the  new  disparity  map  and  variance  map  from  the  current  maps,  we  will  first  map 
each  pixel  to  its  new  location  and  value,  and  then  use  interpolation  to  resample  the  map.  For 
simplicity,  the  development  given  here  only  shows  the  one-dimensional  case,  i.e.  disparity  d  as 
a  function  of  x.  The  extension  to  two  dimensions  is  straightforward. 

The  motion  equations  for  a  point  in  the  pixel  map  (x,  d)  are 

x?  =  x  +  txd  +  rx 
d  =  d  +  tz. 

We  will  assume  that  the  points  which  define  the  patch  under  consideration  have  the  same  tx,  rx, 
and  t2  values.  These  three  parameters  are  actually  stochastic  variables,  due  to  the  uncertainty 
in  camera  motion.  For  the  lateral  motion  case,  we  assume  that  the  mean  of  tx  is  known  and 
non-zero,  while  the  means  of  rx  and  tz  are  zero. 

We  can  write  the  vector  equations  for  the  motion  of  the  points  in  a  patch  as 

x!  =  x  +  rxd  +  rxe 
d'  =  d  +  tze 

where 


x  ~  N(x,r,),  tx  ~  N(4,  ajx),  r*  ~  N(0,  cr?x) 
d  ~  N(d,  Ed),  rz  ~  N(0,  aj),  and  e  =  [1 ...  if. 

The  Jacobian  of  this  vector  equation  is 

d(x', dQ  _  I  txl  d  e  0|T 
5(x,  d,  tx,  rx,tz)  0  I  0  0  e 

and  the  variance  of  the  predicted  points  is 


Var(x'.  d')  = 


,  Ex  +  %Ed  +  dd  Taft  +  eer(7^ 

j  txEd 


Ed  +  eeTaj 


To  obtain  the  new  depth  and  variance  at  a  point  x,  we  must  define  an  interpolation  function 
for  the  patch  surrounding  this  point.  For  a  linear  interpolant,  the  equation  is 


(gj+l  ~_X)  ,  (X  -  Xj) 

‘(xl+l-xt)  1+1  (JtI+1  -  x,) 

=  (1  -  A  )di  +  Ad,+i,  where  A  = 

(xM  -  Xi) 


dd_ 

ddi 

dd_ 

dxi 


(xi+l  -  X ) 


=  (1  -  A) 


(xi+l  -  Xi) 

(di+ 1  -  di)(xi+ 1  -  xi) 


=  —m(  1  —  A),  where  = 


~  4) 

(Xi+ 1  -  jc«  ) 


(JCh-1  -  *,)2 

and  the  associated  Jacobian  is 

— - di~,  -  =  [  —m{\  -  A)  -mX  (1  -  A)  A 

d(Xi,xl+x,dt,dt+x)  L 

The  variance  of  the  new  depth  estimate  is  thus 

Var id)  -  m\il-X)2al  +  \2*lj  +  il-txm)2[il-\)2al  +  \za2diJ 
+  m2[cfa2x  +  cr2]  +  cr2. 


Each  of  the  above  four  terms  can  be  analyzed  separately.  The  first  term  in  the  above  equation, 
which  involves  <t2,  depends  on  the  positional  uncertainty  of  the  points  in  the  old  map.  It 
can  either  be  ignored  (if  each  disparity  element  represents  the  disparity  at  its  center ),  or  a2 
can  be  set  to  The  second  term  is  a  blend  of  the  variances  at  the  two  endpoints  of  the 
interpolated  interval.  Note  that  for  A  =  i,  the  variance  is  actually  reduced  by  half  (the  average 
of  two  uncertain  measurements  is  more  certain).  It  may  be  desirable  to  use  a  pure  blend 
((1  -  A)cr^  +A<7^>i)  to  eliminate  this  bias.  The  second  term  also  encodes  the  interaction  between 
the  disparity  uncertainty  and  the  disparity  gradient  m.  The  third  term  encodes  the  interaction 
between  the  disparity  gradient  and  the  camera  translation  and  pan  uncertainty.  The  final  term  is 
the  uncertainty  in  camera  forward  motion,  which  should  in  practice  be  negligible. 


